International Journal of Trend in Scientific Research and Development (IJTSRD) 

Volume 3 Issue 6, October 2019 Available Online: www.iitsrd.com e-ISSN: 2456 - 6470 

» ♦ 

Weather Prediction Model using 
Random Forest Algorithm and Apache Spark 

Thin Thin Swe 1 , Phyu Phyu 1 , Sandar Pa Pa Thein 2 

iLecturer, Faculty of Information Science, 2 Lecturer, Faculty of Computing, 

^University of Computer Studies, Pathein, Myanmar 


ABSTRACT 


One of the greatest challenge that meteorological department faces are to 
predict weather accurately. These predictions are important because they 
influence daily life and also affect the economy of a state or even a nation. 
Weather predictions are also necessary since they form the first level of 
preparation against the natural disasters which may make difference between 
life and death. They also help to reduce the loss of resources and minimizing 
the mitigation steps that are expected to be taken after a natural disaster 
occurs. This research work focuses on analyzing algorithm on big data that are 
suitable for weather prediction and highlights the performance analysis with 
Random Forest algorithms in the spark framework. 

KEYWORDS: Weather forecasting, Apache Spark , Random Forest algorithms 
(RF); Big Data Analysis. 


I. INTRODUCTION 

Weather forecasting had always been one of the major 
technologically and scientifically challenging issues around 
the world. This is mainly due to two factors: Firstly, it is 
consumed for several human activities and secondly, 
because of opportunism, which is created by numerous 
technological advances that are directly associated to the 
concrete research field, such as the evolution of computation 
and improvement in the measurement systems. Hence, 
making an exact pre- diction contributes to one of the major 
challenges that meteorologists are facing around the world. 
From ancient times, the weather prediction had been one of 
the most interesting and fascinating study domains. 
Scientists have been working to forecast the meteorological 
features by utilizing a number of approaches, some of these 
approaches being better than the others in terms of 
accuracy. Weather forecasting encompasses predicting in 
what way current state of atmosphere will get altered. 
Existing weather situations are attained by ground 
observations, such as the observations from aircrafts, ships, 
satellites, and radars. The information is directed to the 
meteorological centers, which collect, analyze, and project 
the data into a variety of graphs and charts. The computers 
imprint lines on graphs with the help of meteorologists, who 
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look for correcting any errors, if present. These computers 
not only make graphs but also predict how the graphs may 
look sometime in the near future. This estimation of weather 
by computers is acknowledged as numerical weather 
prediction[l]. Hence, for predicting weather by numerical 
means, meteorologists went on developing some 
atmospheric models, which approximate atmosphere by 
consuming mathematical equations to portray how 
atmosphere and rain will have transformations over time. 
These equations are automated into the computer, and the 
data for the current atmospheric conditions are provided 
into the computer. Computers solve these equations to 
conclude how different atmospheric variables may change 
over upcoming years. The resultant is known as prognostic 
chart, which is a forecast chart drawn by the computer. 

II. PREDICTING WEATHER 

Fig. 1 shows that initially the weather data source is 
collected from weather sensors and power stations. These 
weather data can be collected in the different data sources 
like kafka, flume etc. In the proposed system the data set is 
loaded into the spark API and using random forest algorithm 
to regress and classify the weather data. 




@ IJTSRD | Unique Paper ID - IJTSRD29133 | Volume - 3 | Issue - 6 | September - October 2019 Page 549 












International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com elSSN: 2456-6470 



Figure 1: Design of the system 


A. RANDOM FORESTS MODEL 

Random Forests (RF) is the most popular methods in data 
mining. The method is widely used in different time series 
forecasting fields, such as biostatistics, climate monitoring, 
planning in energy industry and weather forecasting. 
Random forest (RF) is an ensemble learning algorithm that 
can handle both high- dimension classification as well as 
regression. RF is a tree- based ensemble method where all 
trees depend on a collection of random variables. That is, the 
forest is grown from many regression trees put together, 
forming an ensemble [4]. After individual trees in ensemble 
are fitted using bootstrap samples, the final decision is 
obtained by aggregating over the ensemble, i.e. by averaging 
the output for regression or by voting for classification. This 
procedure called bagging improves the stability and 
accuracy of the model, reduces variance and helps to avoid 
overfitting. The bias of the bagged trees is the same as that of 
the individual trees, but the variance is decreased by 
reducing the correlation between trees (this is discussed in 
[10]). Random forests correct for decision trees' habit of 
overfitting to their training set and produce a limiting value 
of the generalization error [6]. 

The RF generalization error is estimated by an out-of-bag 
(00B) error, i.e. the error for training points which are not 
contained in the bootstrap training sets (about one-third of 
the points are left out in each bootstrap training set). An 00B 
error estimate is almost identical to that obtained by iV-fold 
cross-validation. The large advantage of RFs is that they can 
be fitted in one sequence, with cross-validation being 
performed along the way. The training can be terminated 
when the 00B error stabilizes [7]. The algorithm of RF for 
regression is shown in Figure-2[5]. 
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Figure2. Algorithm of RF for regression [8] 


Where K represents the number of trees in the forest and F 
represents the number of input variables randomly chosen 
at each split respectively. The number of trees can be 
determined experimentally. And, we can add the successive 
trees during the training procedure until the 00B error 
stabilizes. The RF procedure is not overly sensitive to the 
value of F. The inventors of the algorithm recommend F = 
n/3 for the regression RFs. Another parameter is the 
minimum node size m. The smaller the minimum node size, 
the deeper the trees. In many publications m = 5 is 
recommended. And this is the default value in many 
programs which implement RFs. RFs show small sensitivity 
to this parameter. 

Using RFs we can determine the prediction strength or 
importance of variables which is useful for ranking the 
variables and their selection, to interpret data and to 
understand underlying phenomena. The variable importance 
can be estimated in RF as the increase in prediction error if 
the values of that variable are randomly permuted across the 
00B samples. The increase in error as a result of this 
permuting is averaged over all trees, and divided by the 
standard deviation over the entire ensemble. The more the 
increase of 00B error is, the more important is the variable. 

The original training dataset is formalized as S = {(xi,yj), 

i= 1,2,.,N; j=1,2,....,M} where x is a sample and y is a feature 

variable of S. Namely, the original training dataset contains N 
samples, and there are M feature variables in each sample. The 
main process of the construction of the RF algorithm is 
presented in Fig. 2. 



Fig.2. Process of the construction of the RF Algorithm 


The steps of the construction of the random forest algorithm 
are as follows. 

Stepl: Sampling k training subsets. 

In this step, k training subsets are sampled from the original 
training dataset S in a bootstrap sampling man-ner. Namely, 
N records are selected from S by a random sampling and 
replacement method in each sampling time. After the current 
step, k training subsets are constructed as a collection of 
training subsets St rain- 

Slrain = {Si; S 2 ,.,Sk}. 
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At the same time, the records that are not to be selected in 
each sampling period are composed as an Out-Of-Bag (00B] 
dataset. In this way, k 00B sets are constructed as a 
collection of Soob: 

Soob = {OOBi; OOB 2 ,., OOBk}, 

where k << N, Si n OOBi = 4> and Si OOBi = S. To obtain the 
classification accuracy of each tree model, these 00B sets are 
used as testing sets after the training process. 

Step2: Constructing each decision tree model. 

In an RF model, each meta decision tree is created by CART 
algorithm from each training subset Si. In the growth process 
of each tree, m feature variables of dataset Si are randomly 
selected from M variables. In each tree node's splitting 
process, the gain ratio of each feature variable is calculated, 
and the best one is chosen as the splitting node. This splitting 
process is repeated until a leaf node is generated. Finally, k 
decision trees are trained from k training subsets in the 
same way. 

Step3: Collecting k trees into an RF model. 

The k trained trees are collected into an RF model, which is 
defined in Eq. (1): 

H(X, Kj) = X* hi(x, m0=l2,...,m ) (1) 

i=l 

where hi(x;j] is a meta decision tree classifier, X are the input 
feature vectors of the training dataset, and j is an 
independent and identically distributed random vector that 
determines the growth process of the tree. 

To dig why we select random forest algorithm, the following 
presents some benefits: 

> Random forest algorithm can be used for both 
classifications and regression task. 

> It provides higher accuracy. 

> Random forest classifier will handle the missing values 
and maintain the accuracy of a large proportion of data. 

> If there are more trees, it won't allow overfitting trees in 
the model. 

> It has the power to handle a large data set with higher 
dimensionality [3]. 

B. APACHE SPARK 

Apache Spark is an all-purpose data processing and machine 
learning tool can be used for a variety of operations. Data 
scientist, application developer can integrate Apache Spark 
into their application to query, analyze, transform as scale. It 
is 100 times faster than Hadoop MapReduce. It can handle 
petabytes of data at once, distribute over a cluster of 
thousands of cooperating virtual or physical servers. Apache 
Spark has been developed in Scala and it support Python, R, 
Java and off course Scala Apache spark is fast and general 
purpose engine for large scale data processing [9-10]. 
Architecture of spark has spark core at it bottom and on top 
of which Spark SQL, MLlib, Spark streaming and GraphX 
libraries are provided for data processing[2J. 
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Fig.3. Architecture of spark 


Apache Spark is very good for in memory computing. Spark 
has its own cluster management but it can work with 
Hadoop also. There are three core building blocks of Spark 
programming. Resilient Distributed Datasets (RDD), 
Transformations and Action. RDD is an immutable data 
structure on which various transformations can be applied. 
After transformation any action on RDD can lead to complete 
lineage execution of transformation before result is 
produced. 
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Results 


Fig.4. Working with RDD in Spark 
III. CONCLUSIONS 

In this paper, a random forest algorithm has been proposed 
for big data. The accuracy of the RF algorithm is optimized 
through dimension-reduction and the weighted vote 
approach. Then, combining data-parallel from different data 
station and task-parallel optimization is performed and 
implemented on Apache Spark. Taking advantage of the 
data-parallel optimization, the training dataset is reused and 
the volume of data is reduced significantly. Benefiting from 
the task-parallel optimization, the data transmission cost is 
effectively reduced and the performance of the algorithm is 
obviously improved. Experimental results indicate the 
superiority and notable strengths of RF over the other 
algorithms in terms of classification accuracy, performance, 
and scalability. For future work, we will focus on the 
incremental parallel random forest algorithm for data 
streams in cloud environment, and improve the data 
allocation and task scheduling mechanism for the algorithm 
on a distributed and parallel environment. 
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